Loan defaults pose a significant financial risk to lending institutions. Identifying potential defaulters before issuing loans is critical for maintaining financial health, reducing losses, and ensuring responsible lending. As the lending industry becomes increasingly data-driven, leveraging historical loan and customer data for predictive insights can dramatically improve decision-making and risk assessment.
The goal is to develop a robust machine learning model that accurately predicts whether a loan applicant is likely to default. This will enable the institution to make informed lending decisions, minimize risk exposure, and optimize its credit portfolio.
Which customer attributes are the strongest indicators of loan default risk?
What preprocessing and modeling approaches yield the most accurate and generalizable predictions?
How can we balance false positives and false negatives, especially given the business cost of missing defaulters?
This is a binary classification problem where the target variable BAD indicates whether a customer defaulted (1) or not (0). Using historical customer and loan data, we aim to train and evaluate various machine learning models to classify new loan applicants as high or low risk, thereby supporting strategic credit decision-making.
The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.
BAD: 1 = Client defaulted on loan, 0 = loan repaid
LOAN: Amount of loan approved.
MORTDUE: Amount due on the existing mortgage.
VALUE: Current value of the property.
REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)
JOB: The type of job that loan applicant has such as manager, self, etc.
YOJ: Years at present job.
DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).
DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).
CLAGE: Age of the oldest credit line in months.
NINQ: Number of recent credit inquiries.
CLNO: Number of existing credit lines.
DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
from sklearn.preprocessing import LabelEncoder, StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, ConfusionMatrixDisplay
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import warnings
data = pd.read_csv("hmeq.csv")
df = data.copy()
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB
print('Shape:', df.shape)
df.head()
Shape: (5960, 13)
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | NaN |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | NaN |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | NaN |
| 3 | 1 | 1500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | NaN |
df.describe()
| BAD | LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5960.000000 | 5960.000000 | 5442.000000 | 5848.000000 | 5445.000000 | 5252.000000 | 5380.000000 | 5652.000000 | 5450.000000 | 5738.000000 | 4693.000000 |
| mean | 0.199497 | 18607.969799 | 73760.817200 | 101776.048741 | 8.922268 | 0.254570 | 0.449442 | 179.766275 | 1.186055 | 21.296096 | 33.779915 |
| std | 0.399656 | 11207.480417 | 44457.609458 | 57385.775334 | 7.573982 | 0.846047 | 1.127266 | 85.810092 | 1.728675 | 10.138933 | 8.601746 |
| min | 0.000000 | 1100.000000 | 2063.000000 | 8000.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.524499 |
| 25% | 0.000000 | 11100.000000 | 46276.000000 | 66075.500000 | 3.000000 | 0.000000 | 0.000000 | 115.116702 | 0.000000 | 15.000000 | 29.140031 |
| 50% | 0.000000 | 16300.000000 | 65019.000000 | 89235.500000 | 7.000000 | 0.000000 | 0.000000 | 173.466667 | 1.000000 | 20.000000 | 34.818262 |
| 75% | 0.000000 | 23300.000000 | 91488.000000 | 119824.250000 | 13.000000 | 0.000000 | 0.000000 | 231.562278 | 2.000000 | 26.000000 | 39.003141 |
| max | 1.000000 | 89900.000000 | 399550.000000 | 855909.000000 | 41.000000 | 10.000000 | 15.000000 | 1168.233561 | 17.000000 | 71.000000 | 203.312149 |
def check_missing_values(data):
'''
Prints the number of features with missing values, their percentage
in the total dataset, and visualizes these missing values.
---
### Parameters
data <DataFrame>: DataFrame with the data to be checked for missing values
'''
# Calculate the number of missing values per column and filter columns that have missing values
count = data.isnull().sum()
count = count[count > 0]
# Early exit if no missing values are found
if count.empty:
print("No missing values found in the dataset.")
return
# Calculate the percentage of missing values
percentage = (count / len(data)) * 100
# Creating a DataFrame to display count and percentage of missing values
result = pd.DataFrame({'Count': count, 'Percentage %': percentage.round(2)})
# Print results
print(f"{result.shape[0]} columns have missing values")
print('-'*30)
print(result)
# Plotting the missing values
msno.bar(data, figsize=(10, 5), fontsize=12, color="skyblue")
check_missing_values(df)
11 columns have missing values
------------------------------
Count Percentage %
MORTDUE 518 8.69
VALUE 112 1.88
REASON 252 4.23
JOB 279 4.68
YOJ 515 8.64
DEROG 708 11.88
DELINQ 580 9.73
CLAGE 308 5.17
NINQ 510 8.56
CLNO 222 3.72
DEBTINC 1267 21.26
# Define numerical and categorical columns based on your dataset
num_cols = ['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
cat_cols = ['REASON', 'JOB']
# Create a skewness summary DataFrame
skew_summary = pd.DataFrame({
'Feature': num_cols,
'Skewness': [df[col].skew().round(2) for col in num_cols]
}).sort_values(by='Skewness', ascending=False)
skew_summary
| Feature | Skewness | |
|---|---|---|
| 4 | DEROG | 5.32 |
| 5 | DELINQ | 4.02 |
| 2 | VALUE | 3.05 |
| 9 | DEBTINC | 2.85 |
| 7 | NINQ | 2.62 |
| 0 | LOAN | 2.02 |
| 1 | MORTDUE | 1.81 |
| 6 | CLAGE | 1.34 |
| 3 | YOJ | 0.99 |
| 8 | CLNO | 0.78 |
All imputed features were found to be moderately to heavily skewed (skewness > 0.5), making the median a more robust choice over the mean to prevent distortion from outliers
There are a lot of missing values, in EDA there are going to be checks if data is skewed median will be used for inputing missing values, if not mean for numeric variables. For categorical variables mode will be applied.
Leading Questions:
def histogram_boxplot(data, feature, figsize=(12, 6), kde=False, bins=30):
"""
Combined boxplot and histogram for univariate analysis.
Parameters:
-----------
data : pd.DataFrame
The dataset to plot from.
feature : str
Column name to plot.
figsize : tuple
Size of the entire figure.
kde : bool
Whether to include a Kernel Density Estimate in the histogram.
bins : int or None
Number of histogram bins.
"""
fig, (ax_box, ax_hist) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize
)
# Boxplot (top)
sns.boxplot(
data=data, x=feature, ax=ax_box, showmeans=True, color="skyblue"
)
ax_box.set(xlabel="")
# Histogram (bottom)
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist, bins=bins, color="steelblue"
)
ax_hist.axvline(data[feature].mean(), color="green", linestyle="--", label="Mean")
ax_hist.axvline(data[feature].median(), color="black", linestyle="-", label="Median")
ax_hist.legend()
plt.suptitle(f"Boxplot & Histogram for {feature}", fontsize=14)
plt.tight_layout()
plt.show()
for col in num_cols:
histogram_boxplot(df, col, kde=True)
This detailed histogram + boxplot analysis revealed key patterns in the dataset:
DEROG, DELINQ, and DEBTINC exhibit patterns linked to credit risk and will be central to modeling.# Calculate default proportions
default_counts = df['BAD'].value_counts()
labels = ['No Default (0)', 'Default (1)']
# Plot pie chart
default_counts.plot(
kind='pie',
labels=labels,
explode=[0.05, 0],
autopct='%1.1f%%',
startangle=90,
colors=['#66b3ff', '#ff9999'],
wedgeprops={'edgecolor': 'black'}
)
plt.title("Loan Repayment Status (BAD Variable)", fontsize=14)
plt.ylabel('')
plt.axis('equal')
plt.show()
About ~20% of applicants defaulted (BAD = 1) and ~80% repaid. There is some class imbalance, but it's manageable without special sampling techniques yet. Recall and AUC will be more informative than accuracy.
# Value counts for REASON
reason_counts = df['REASON'].value_counts()
labels = reason_counts.index.tolist()
colors = ['#66b3ff', '#ff9999'] # Customize based on # of categories
# Pie chart
reason_counts.plot(
kind='pie',
labels=labels,
autopct='%1.1f%%',
explode=[0.05, 0],
startangle=90,
colors=colors,
wedgeprops={'edgecolor': 'black'}
)
plt.title("Loan Reason Distribution", fontsize=14)
plt.ylabel('')
plt.axis('equal') # Equal aspect ratio for perfect circle
plt.show()
Majority of applicants ~69% requested loans for debt consolidation, which may indicate higher financial stress or prior credit usage.
Only ~31% applied for home improvement, which could reflect a stronger financial position, as they are investing in their property.
# JOB -> Manager, Office, Sales, Self-employed, Professional/Executive, Other
print("JOB value counts:")
print(df['JOB'].value_counts())
sns.countplot(x='JOB', data=df)
plt.title("Job Type Distribution")
plt.xticks(rotation=45)
plt.show()
JOB value counts: JOB Other 2388 ProfExe 1276 Office 948 Mgr 767 Self 193 Sales 109 Name: count, dtype: int64
# Plot 1: JOB vs BAD with % annotations
plt.figure(figsize=(8, 4))
ax = sns.countplot(x='JOB', hue='BAD', data=df)
# Title and labels
plt.title("Default Rate by Job Type")
plt.xlabel("Job Category")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.legend(title="BAD", labels=["No Default (0)", "Default (1)"])
# Add % labels only for non-zero bars
for p in ax.patches:
count = p.get_height()
if count > 0:
x = p.get_x() + p.get_width() / 2.0
label = f"{100 * count / len(df):.1f}%"
ax.annotate(label, (x, count), ha='center', va='bottom', fontsize=9)
plt.show()
# Plot 2: REASON vs BAD with % annotations
plt.figure(figsize=(8, 4))
ax = sns.countplot(x='REASON', hue='BAD', data=df)
# Title and labels
plt.title("Default Rate by Loan Reason")
plt.xlabel("Loan Reason")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.legend(title="BAD", labels=["No Default (0)", "Default (1)"])
# Add % labels only for non-zero bars
for p in ax.patches:
count = p.get_height()
if count > 0:
x = p.get_x() + p.get_width() / 2.0
label = f"{100 * count / len(df):.1f}%"
ax.annotate(label, (x, count), ha='center', va='bottom', fontsize=9)
plt.show()
for col in num_cols:
plt.figure(figsize=(7,4))
sns.boxplot(data=df, x='BAD', y=col, hue='BAD', palette='Set1', dodge=False, legend=False)
plt.title(f"{col} by Loan Default (BAD)")
plt.xlabel("BAD (0 = No Default, 1 = Default)")
plt.ylabel(col)
plt.show()
# Build correlation matrix for numerical columns
corr = df[num_cols].corr()
# Plot the heatmap
plt.figure(figsize=(16,12))
sns.heatmap(corr, cmap='coolwarm', vmax=1, vmin=-1,
annot=True, fmt=".2f",
xticklabels=corr.columns,
yticklabels=corr.columns)
plt.title('Correlation Matrix for Numerical Columns')
plt.show()
To better understand relationships between numerical variables, especially to detect multicollinearity or redundancy.
sns.pairplot(df, hue='BAD', diag_kind='kde', palette='Set1')
plt.suptitle('Pairplot of Features Colored by Default Status', y=1.02)
plt.show()
# Fill CATEGORICAL features with 'Other'
df['REASON'] = df['REASON'].fillna('Other')
df['JOB'] = df['JOB'].fillna('Other')
# Fill CREDIT HISTORY fields with 0 (assuming no issues)
for col in ['DEROG', 'DELINQ', 'NINQ']:
df[col] = df[col].fillna(0)
# Fill skewed NUMERIC features with median to preserve central tendency and avoid outlier distortion
for col in ['MORTDUE', 'VALUE', 'CLAGE', 'CLNO', 'DEBTINC']:
df[col] = df[col].fillna(df[col].median())
# Group-wise median imputation for YOJ based on JOB
df['YOJ'] = df.groupby('JOB')['YOJ'].transform(lambda x: x.fillna(x.median()))
print(df.isnull().sum()[df.isnull().sum() > 0])
Series([], dtype: int64)
def cap_outliers_iqr(df, columns, factor=1.5):
for col in columns:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - factor * IQR
upper_bound = Q3 + factor * IQR
df[col] = np.where(df[col] > upper_bound, upper_bound,
np.where(df[col] < lower_bound, lower_bound, df[col]))
return df
# Columns to apply IQR capping
cap_cols = ['LOAN', 'MORTDUE', 'VALUE', 'DEBTINC', 'CLNO', 'NINQ', 'DEROG', 'DELINQ']
df = cap_outliers_iqr(df, cap_cols)
What are the the most important observations and insights from the data based on the EDA performed?
# Check for outliers again
for col in cap_cols:
plt.figure(figsize=(7,4))
sns.boxplot(data=df, y=col)
plt.title(f"Boxplot of {col} after IQR Capping")
plt.show()
df.isnull().sum()
BAD 0 LOAN 0 MORTDUE 0 VALUE 0 REASON 0 JOB 0 YOJ 0 DEROG 0 DELINQ 0 CLAGE 0 NINQ 0 CLNO 0 DEBTINC 0 dtype: int64
print("LOAN mean:", df['LOAN'].mean())
print("LOAN std:", df['LOAN'].std())
LOAN mean: 18051.895973154362 LOAN std: 9252.565294021877
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null float64 2 MORTDUE 5960 non-null float64 3 VALUE 5960 non-null float64 4 REASON 5960 non-null object 5 JOB 5960 non-null object 6 YOJ 5960 non-null float64 7 DEROG 5960 non-null float64 8 DELINQ 5960 non-null float64 9 CLAGE 5960 non-null float64 10 NINQ 5960 non-null float64 11 CLNO 5960 non-null float64 12 DEBTINC 5960 non-null float64 dtypes: float64(10), int64(1), object(2) memory usage: 605.4+ KB
# Split the data into features and target variable
X = df.drop(columns='BAD')
y = df['BAD']
# Ensure enough class diversity
print("Class distribution:\n", y.value_counts())
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=44, stratify=y)
# Check the shape of the splits
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
Class distribution: BAD 0 4771 1 1189 Name: count, dtype: int64 X_train shape: (4768, 12) X_test shape: (1192, 12) y_train shape: (4768,) y_test shape: (1192,)
num_cols = ['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ',
'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])
print("Train class distribution:\n", y_train.value_counts())
print("Test class distribution:\n", y_test.value_counts())
Train class distribution: BAD 0 3817 1 951 Name: count, dtype: int64 Test class distribution: BAD 0 954 1 238 Name: count, dtype: int64
# Define column types
cat_cols = ['REASON', 'JOB']
num_cols = [col for col in X_train.columns if col not in cat_cols]
# Preprocessing: scaling + encoding
preprocessor = ColumnTransformer(transformers=[
('num', StandardScaler(), num_cols),
('cat', OrdinalEncoder(), cat_cols)
])
# Build pipeline
logreg_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=44))
])
# Fit model
logreg_pipeline.fit(X_train, y_train)
# Predict
y_pred = logreg_pipeline.predict(X_test)
y_proba = logreg_pipeline.predict_proba(X_test)[:, 1]
y_train_pred = logreg_pipeline.predict(X_train)
y_train_proba = logreg_pipeline.predict_proba(X_train)[:, 1]
from sklearn.metrics import confusion_matrix
# Evaluation on training set
print("Classification Report (Train):")
print(classification_report(y_train, y_train_pred))
# Round and print metrics
print("Accuracy:", round(accuracy_score(y_train, y_train_pred), 2))
print("Precision:", round(precision_score(y_train, y_train_pred), 2))
print("Recall:", round(recall_score(y_train, y_train_pred), 2))
print("F1 Score:", round(f1_score(y_train, y_train_pred), 2))
print("AUC Score:", round(roc_auc_score(y_train, y_train_proba), 2))
# Confusion Matrix with Labels
cm_train = confusion_matrix(y_train, y_train_pred)
sns.heatmap(cm_train, annot=True, fmt='d', cmap='Greens',
xticklabels=['No Default', 'Default'],
yticklabels=['No Default', 'Default'])
plt.title("Confusion Matrix - Logistic Regression (Train)")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
# ROC Curve
fpr_train, tpr_train, _ = roc_curve(y_train, y_train_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr_train, tpr_train, label=f"AUC = {roc_auc_score(y_train, y_train_proba):.2f}")
plt.plot([0, 1], [0, 1], linestyle='--', label='Random Guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Logistic Regression (Train)')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
Classification Report (Train):
precision recall f1-score support
0 0.88 0.63 0.74 3817
1 0.31 0.65 0.42 951
accuracy 0.64 4768
macro avg 0.59 0.64 0.58 4768
weighted avg 0.77 0.64 0.67 4768
Accuracy: 0.64
Precision: 0.31
Recall: 0.65
F1 Score: 0.42
AUC Score: 0.7
# Evaluation on test set
print("Classification Report (Test):")
print(classification_report(y_test, y_pred))
# Round and print individual metrics
print("Accuracy:", round(accuracy_score(y_test, y_pred), 2))
print("Precision:", round(precision_score(y_test, y_pred), 2))
print("Recall:", round(recall_score(y_test, y_pred), 2))
print("F1 Score:", round(f1_score(y_test, y_pred), 2))
print("AUC Score:", round(roc_auc_score(y_test, y_proba), 2))
# Confusion Matrix with Labels
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Default', 'Default'], yticklabels=['No Default', 'Default'])
plt.title("Confusion Matrix - Logistic Regression (Test)")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc_score(y_test, y_proba):.2f}")
plt.plot([0, 1], [0, 1], linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Logistic Regression (Test)')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
# Save metrics for comparison table later
logreg_metrics = {
'model': 'Logistic Regression',
'accuracy': round(accuracy_score(y_test, y_pred), 2),
'precision': round(precision_score(y_test, y_pred), 2),
'recall': round(recall_score(y_test, y_pred), 2),
'f1': round(f1_score(y_test, y_pred), 2),
'auc': round(roc_auc_score(y_test, y_proba), 2)
}
Classification Report (Test):
precision recall f1-score support
0 0.88 0.63 0.73 954
1 0.31 0.65 0.42 238
accuracy 0.64 1192
macro avg 0.59 0.64 0.58 1192
weighted avg 0.76 0.64 0.67 1192
Accuracy: 0.64
Precision: 0.31
Recall: 0.65
F1 Score: 0.42
AUC Score: 0.69
# Columns setup (same as before)
cat_cols = ['REASON', 'JOB']
num_cols = [col for col in X_train.columns if col not in cat_cols]
# Preprocessing: only encoding (no scaling needed for tree)
tree_preprocessor = ColumnTransformer(transformers=[
('cat', OrdinalEncoder(), cat_cols)
], remainder='passthrough')
# Decision Tree Model Pipeline
dt_pipeline = Pipeline(steps=[
('preprocessor', tree_preprocessor),
('classifier', DecisionTreeClassifier(class_weight='balanced', random_state=42))
])
# Train model
dt_pipeline.fit(X_train, y_train)
# Predict
y_pred_dt = dt_pipeline.predict(X_test)
y_proba_dt = dt_pipeline.predict_proba(X_test)[:, 1]
# Predict on training as well for comparison later
y_train_pred_dt = dt_pipeline.predict(X_train)
y_train_proba_dt = dt_pipeline.predict_proba(X_train)[:, 1]
# Test set evaluation
print("Classification Report (Test - Decision Tree):")
print(classification_report(y_test, y_pred_dt))
print("Accuracy:", round(accuracy_score(y_test, y_pred_dt), 2))
print("Precision:", round(precision_score(y_test, y_pred_dt), 2))
print("Recall:", round(recall_score(y_test, y_pred_dt), 2))
print("F1 Score:", round(f1_score(y_test, y_pred_dt), 2))
print("AUC Score:", round(roc_auc_score(y_test, y_proba_dt), 2))
# Confusion Matrix
cm_dt = confusion_matrix(y_test, y_pred_dt)
sns.heatmap(cm_dt, annot=True, fmt='d', cmap='Blues',
xticklabels=['No Default', 'Default'],
yticklabels=['No Default', 'Default'])
plt.title("Confusion Matrix - Decision Tree (Test)")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
# ROC Curve
fpr_dt, tpr_dt, _ = roc_curve(y_test, y_proba_dt)
plt.figure(figsize=(8, 6))
plt.plot(fpr_dt, tpr_dt, label=f"AUC = {roc_auc_score(y_test, y_proba_dt):.2f}")
plt.plot([0, 1], [0, 1], linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Decision Tree (Test)')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
# Save metrics
dt_metrics = {
'model': 'Decision Tree',
'accuracy': round(accuracy_score(y_test, y_pred_dt), 2),
'precision': round(precision_score(y_test, y_pred_dt), 2),
'recall': round(recall_score(y_test, y_pred_dt), 2),
'f1': round(f1_score(y_test, y_pred_dt), 2),
'auc': round(roc_auc_score(y_test, y_proba_dt), 2)
}
Classification Report (Test - Decision Tree):
precision recall f1-score support
0 0.89 0.91 0.90 954
1 0.60 0.54 0.57 238
accuracy 0.84 1192
macro avg 0.75 0.72 0.73 1192
weighted avg 0.83 0.84 0.83 1192
Accuracy: 0.84
Precision: 0.6
Recall: 0.54
F1 Score: 0.57
AUC Score: 0.72
# Preprocessing
cat_cols = ['REASON', 'JOB']
num_cols = [col for col in X_train.columns if col not in cat_cols]
preprocessor = ColumnTransformer(transformers=[
('num', StandardScaler(), num_cols),
('cat', OrdinalEncoder(), cat_cols)
])
# Classifier
dt = DecisionTreeClassifier(random_state=42, class_weight='balanced')
# Pipeline
dt_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', dt)
])
# Parameter grid
param_grid = {
'classifier__max_depth': [3, 5, 7, 10, None],
'classifier__min_samples_leaf': [1, 5, 10, 20],
'classifier__min_samples_split': [2, 5, 10],
'classifier__criterion': ['gini', 'entropy']
}
# Grid Search with Recall as scoring
grid_search = GridSearchCV(dt_pipeline, param_grid, cv=5, scoring='recall', n_jobs=-1)
grid_search.fit(X_train, y_train)
# Best Model
best_dt = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)
Best Parameters: {'classifier__criterion': 'entropy', 'classifier__max_depth': None, 'classifier__min_samples_leaf': 20, 'classifier__min_samples_split': 2}
# Predict on training set
y_train_pred_dt = best_dt.predict(X_train)
y_train_proba_dt = best_dt.predict_proba(X_train)[:, 1]
# Evaluation on training set
print("Classification Report (Train - Tuned Decision Tree):")
print(classification_report(y_train, y_train_pred_dt))
# Round and print metrics
print("Accuracy:", round(accuracy_score(y_train, y_train_pred_dt), 2))
print("Precision:", round(precision_score(y_train, y_train_pred_dt), 2))
print("Recall:", round(recall_score(y_train, y_train_pred_dt), 2))
print("F1 Score:", round(f1_score(y_train, y_train_pred_dt), 2))
print("AUC Score:", round(roc_auc_score(y_train, y_train_proba_dt), 2))
# Confusion Matrix with Labels
cm_train = confusion_matrix(y_train, y_train_pred_dt)
sns.heatmap(cm_train, annot=True, fmt='d', cmap='Greens',
xticklabels=['No Default', 'Default'],
yticklabels=['No Default', 'Default'])
plt.title("Confusion Matrix - Decision Tree Tuned - Train")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
# ROC Curve
fpr_train, tpr_train, _ = roc_curve(y_train, y_train_proba_dt)
plt.figure(figsize=(8, 6))
plt.plot(fpr_train, tpr_train, label=f"AUC = {roc_auc_score(y_train, y_train_proba_dt):.2f}")
plt.plot([0, 1], [0, 1], linestyle='--', label='Random Guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Decision Tree Tuned - Train')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
Classification Report (Train - Tuned Decision Tree):
precision recall f1-score support
0 0.97 0.82 0.89 3817
1 0.56 0.91 0.69 951
accuracy 0.84 4768
macro avg 0.77 0.86 0.79 4768
weighted avg 0.89 0.84 0.85 4768
Accuracy: 0.84
Precision: 0.56
Recall: 0.91
F1 Score: 0.69
AUC Score: 0.94
# Predict on test set
y_pred_dt = best_dt.predict(X_test)
y_proba_dt = best_dt.predict_proba(X_test)[:, 1]
# Classification report
print("Classification Report (Test - Tuned Decision Tree):")
print(classification_report(y_test, y_pred_dt))
print("Accuracy:", round(accuracy_score(y_test, y_pred_dt), 2))
print("Precision:", round(precision_score(y_test, y_pred_dt), 2))
print("Recall:", round(recall_score(y_test, y_pred_dt), 2))
print("F1 Score:", round(f1_score(y_test, y_pred_dt), 2))
print("AUC Score:", round(roc_auc_score(y_test, y_proba_dt), 2))
# Confusion Matrix
sns.heatmap(confusion_matrix(y_test, y_pred_dt), annot=True, fmt='d', cmap='Blues',
xticklabels=['No Default', 'Default'],
yticklabels=['No Default', 'Default'])
plt.title("Confusion Matrix - Decision Tree Tuned - Test")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_proba_dt)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc_score(y_test, y_proba_dt):.2f}")
plt.plot([0, 1], [0, 1], linestyle='--', label='Random Guessing')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Decision Tree Tuned - Test')
plt.legend()
plt.grid(True)
plt.show()
# Save metrics
dt_tuned_metrics = {
'model': 'Tuned Decision Tree',
'accuracy': round(accuracy_score(y_test, y_pred_dt), 2),
'precision': round(precision_score(y_test, y_pred_dt), 2),
'recall': round(recall_score(y_test, y_pred_dt), 2),
'f1': round(f1_score(y_test, y_pred_dt), 2),
'auc': round(roc_auc_score(y_test, y_proba_dt), 2)
}
Classification Report (Test - Tuned Decision Tree):
precision recall f1-score support
0 0.95 0.78 0.86 954
1 0.49 0.84 0.62 238
accuracy 0.79 1192
macro avg 0.72 0.81 0.74 1192
weighted avg 0.86 0.79 0.81 1192
Accuracy: 0.79
Precision: 0.49
Recall: 0.84
F1 Score: 0.62
AUC Score: 0.87
# feature importance
importances = best_dt.named_steps['classifier'].feature_importances_
feature_names = num_cols + cat_cols
importances_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
importances_df = importances_df.sort_values(by='Importance', ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importances_df)
plt.title("Feature Importance - Tuned Decision Tree")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()
# Step 1: Column setup
cat_cols = ['REASON', 'JOB']
num_cols = [col for col in X_train.columns if col not in cat_cols]
# Step 2: Preprocessing - encoding only
rf_preprocessor = ColumnTransformer(transformers=[
('cat', OrdinalEncoder(), cat_cols)
], remainder='passthrough')
# Step 3: Model pipeline
rf_pipeline = Pipeline(steps=[
('preprocessor', rf_preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42))
])
# Step 4: Train model
rf_pipeline.fit(X_train, y_train)
# Step 5: Predict on train/test
y_pred_rf = rf_pipeline.predict(X_test)
y_proba_rf = rf_pipeline.predict_proba(X_test)[:, 1]
y_train_pred_rf = rf_pipeline.predict(X_train)
y_train_proba_rf = rf_pipeline.predict_proba(X_train)[:, 1]
# Test Set Evaluation
print("Classification Report (Test - Random Forest):")
print(classification_report(y_test, y_pred_rf))
print("Accuracy:", round(accuracy_score(y_test, y_pred_rf), 2))
print("Precision:", round(precision_score(y_test, y_pred_rf), 2))
print("Recall:", round(recall_score(y_test, y_pred_rf), 2))
print("F1 Score:", round(f1_score(y_test, y_pred_rf), 2))
print("AUC Score:", round(roc_auc_score(y_test, y_proba_rf), 2))
# Confusion Matrix - Test
sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d', cmap='Blues',
xticklabels=['No Default', 'Default'], yticklabels=['No Default', 'Default'])
plt.title("Confusion Matrix - Random Forest (Test)")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
# ROC Curve - Test
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_proba_rf)
plt.figure(figsize=(8, 6))
plt.plot(fpr_rf, tpr_rf, label=f'AUC = {roc_auc_score(y_test, y_proba_rf):.2f}')
plt.plot([0, 1], [0, 1], linestyle='--', label='Random Guess')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Random Forest (Test)")
plt.legend()
plt.grid(True)
plt.show()
# Save metrics for comparison table later
rf_metrics = {
'model': 'Random Forest',
'accuracy': round(accuracy_score(y_test, y_pred_rf), 2),
'precision': round(precision_score(y_test, y_pred_rf), 2),
'recall': round(recall_score(y_test, y_pred_rf), 2),
'f1': round(f1_score(y_test, y_pred_rf), 2),
'auc': round(roc_auc_score(y_test, y_proba_rf), 2)
}
Classification Report (Test - Random Forest):
precision recall f1-score support
0 0.91 0.96 0.93 954
1 0.80 0.62 0.70 238
accuracy 0.89 1192
macro avg 0.85 0.79 0.82 1192
weighted avg 0.89 0.89 0.89 1192
Accuracy: 0.89
Precision: 0.8
Recall: 0.62
F1 Score: 0.7
AUC Score: 0.94
# Preprocessing (reusing structure from earlier)
rf_preprocessor = ColumnTransformer(transformers=[
('cat', OrdinalEncoder(), cat_cols)
], remainder='passthrough')
# Base classifier (can omit n_estimators here to grid it)
rf = RandomForestClassifier(random_state=42)
# Pipeline
rf_pipeline = Pipeline(steps=[
('preprocessor', rf_preprocessor),
('classifier', rf)
])
# Parameter grid based on best practices
rf_param_grid = {
'classifier__n_estimators': [50, 100, 150, 200, 250, 300],
'classifier__max_depth': [5, 10, 15, None],
'classifier__min_samples_split': [2, 5, 10],
'classifier__min_samples_leaf': [1, 2, 4],
'classifier__max_features': ['sqrt', 'log2', None],
'classifier__class_weight': [None, 'balanced']
}
# Grid search with recall scoring (can swap to f1 or auc later)
grid_search_rf = GridSearchCV(
estimator=rf_pipeline,
param_grid=rf_param_grid,
scoring='recall', # or 'f1' or 'roc_auc'
cv=5,
n_jobs=-1,
verbose=1
)
# Run it
grid_search_rf.fit(X_train, y_train)
# Best model
best_rf = grid_search_rf.best_estimator_
print("Best Parameters:", grid_search_rf.best_params_)
Fitting 5 folds for each of 1296 candidates, totalling 6480 fits
Best Parameters: {'classifier__class_weight': 'balanced', 'classifier__max_depth': 5, 'classifier__max_features': 'sqrt', 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 5, 'classifier__n_estimators': 50}
# Predict
y_pred_best_rf = best_rf.predict(X_test)
y_proba_best_rf = best_rf.predict_proba(X_test)[:, 1]
# Metrics
print("Classification Report (Tuned Random Forest - Test):")
print(classification_report(y_test, y_pred_best_rf))
print("Accuracy:", round(accuracy_score(y_test, y_pred_best_rf), 2))
print("Precision:", round(precision_score(y_test, y_pred_best_rf), 2))
print("Recall:", round(recall_score(y_test, y_pred_best_rf), 2))
print("F1 Score:", round(f1_score(y_test, y_pred_best_rf), 2))
print("AUC Score:", round(roc_auc_score(y_test, y_proba_best_rf), 2))
# Confusion Matrix
sns.heatmap(confusion_matrix(y_test, y_pred_best_rf), annot=True, fmt='d', cmap='Blues',
xticklabels=['No Default', 'Default'], yticklabels=['No Default', 'Default'])
plt.title("Confusion Matrix - Tuned Random Forest")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()
# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_proba_best_rf)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_proba_best_rf):.2f}')
plt.plot([0, 1], [0, 1], linestyle='--', label='Random Guess')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Tuned Random Forest")
plt.legend()
plt.grid(True)
plt.show()
# Save metrics for comparison table later
rf_tuned_metrics = {
'model': 'Tuned Random Forest',
'accuracy': round(accuracy_score(y_test, y_pred_best_rf), 2),
'precision': round(precision_score(y_test, y_pred_best_rf), 2),
'recall': round(recall_score(y_test, y_pred_best_rf), 2),
'f1': round(f1_score(y_test, y_pred_best_rf), 2),
'auc': round(roc_auc_score(y_test, y_proba_best_rf), 2)
}
Classification Report (Tuned Random Forest - Test):
precision recall f1-score support
0 0.95 0.78 0.86 954
1 0.49 0.83 0.61 238
accuracy 0.79 1192
macro avg 0.72 0.81 0.74 1192
weighted avg 0.86 0.79 0.81 1192
Accuracy: 0.79
Precision: 0.49
Recall: 0.83
F1 Score: 0.61
AUC Score: 0.88
# feature importance
importances = best_rf.named_steps['classifier'].feature_importances_
feature_names = num_cols + cat_cols
importances_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
importances_df = importances_df.sort_values(by='Importance', ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importances_df)
plt.title("Feature Importance - Tuned Random Forest")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()
def evaluate_thresholds(y_true, y_proba, thresholds=np.arange(0.1, 0.91, 0.05)):
results = []
for thresh in thresholds:
y_pred_thresh = (y_proba >= thresh).astype(int)
precision = precision_score(y_true, y_pred_thresh)
recall = recall_score(y_true, y_pred_thresh)
f1 = f1_score(y_true, y_pred_thresh)
acc = accuracy_score(y_true, y_pred_thresh)
results.append({
'Threshold': round(thresh, 2),
'Precision': round(precision, 2),
'Recall': round(recall, 2),
'F1': round(f1, 2),
'Accuracy': round(acc, 2)
})
return pd.DataFrame(results)
# Example for Tuned Random Forest
threshold_metrics_rf = evaluate_thresholds(y_test, y_proba_rf)
# Example for Tuned Decision Tree
threshold_metrics_dt = evaluate_thresholds(y_test, y_proba_dt)
def plot_threshold_curves(metrics_df, model_name="Model"):
plt.figure(figsize=(10,6))
plt.plot(metrics_df['Threshold'], metrics_df['Precision'], label='Precision')
plt.plot(metrics_df['Threshold'], metrics_df['Recall'], label='Recall')
plt.plot(metrics_df['Threshold'], metrics_df['F1'], label='F1 Score')
plt.plot(metrics_df['Threshold'], metrics_df['Accuracy'], label='Accuracy')
plt.xlabel("Threshold")
plt.ylabel("Score")
plt.title(f"Threshold vs Metrics - {model_name}")
plt.legend()
plt.grid(True)
plt.show()
# Plot for RF
plot_threshold_curves(threshold_metrics_rf, model_name="Tuned Random Forest")
# Plot for DT
plot_threshold_curves(threshold_metrics_dt, model_name="Tuned Decision Tree")
def optimal_threshold(probs, y_true, model_name="Model"):
thresholds = np.arange(0.1, 0.91, 0.01)
results = []
for thresh in thresholds:
y_pred_thresh = (probs >= thresh).astype(int)
recall = recall_score(y_true, y_pred_thresh)
if recall >= 0.80: # constraint
precision = precision_score(y_true, y_pred_thresh)
f1 = f1_score(y_true, y_pred_thresh)
acc = accuracy_score(y_true, y_pred_thresh)
results.append({
'threshold': thresh,
'precision': precision,
'recall': recall,
'f1_score': f1,
'accuracy': acc
})
results_df = pd.DataFrame(results)
if results_df.empty:
print(f"No thresholds found for {model_name} with recall >= 0.80")
return None, None
best = results_df.loc[results_df['f1_score'].idxmax()]
print(f"🔍 {model_name} | Best Threshold (recall ≥ 0.80): {best['threshold']:.2f}")
print(f"Precision: {best['precision']:.2f}, Recall: {best['recall']:.2f}, F1: {best['f1_score']:.2f}, Accuracy: {best['accuracy']:.2f}")
return best['threshold'], results_df
rf_probs = best_rf.predict_proba(X_test)[:, 1] # Probabilities for class 1 (default)
best_thresh_rf, rf_thresh_df = optimal_threshold(rf_probs, y_test, model_name="Tuned Random Forest")
🔍 Tuned Random Forest | Best Threshold (recall ≥ 0.80): 0.51 Precision: 0.53, Recall: 0.81, F1: 0.64, Accuracy: 0.82
dt_probs = best_dt.predict_proba(X_test)[:, 1] # Probabilities for class 1 (default)
best_thresh_dt, dt_thresh_df = optimal_threshold(dt_probs, y_test, model_name="Tuned Decision Tree")
🔍 Tuned Decision Tree | Best Threshold (recall ≥ 0.80): 0.61 Precision: 0.56, Recall: 0.82, F1: 0.67, Accuracy: 0.83
Threshold Adjustment Decision
I explored probability thresholds to improve the recall-precision balance. While thresholds such as 0.61 showed potential to increase recall, we retained the default threshold (0.5) for model balance.
# Combine saved metrics
all_metrics = pd.DataFrame([
logreg_metrics,
dt_metrics,
dt_tuned_metrics,
rf_metrics,
rf_tuned_metrics
])
# Set model names as index
all_metrics.set_index('model', inplace=True)
# Plotting
plt.figure(figsize=(12, 6))
all_metrics.plot(kind='bar', figsize=(12, 6), colormap='Set2')
plt.title('Model Performance Comparison (Test Set)', fontsize=16)
plt.ylabel('Score')
plt.xticks(rotation=45)
plt.ylim(0, 1)
plt.legend(loc='lower right')
plt.grid(axis='y')
plt.tight_layout()
plt.show()
<Figure size 1200x600 with 0 Axes>
# Define steps
steps = [
"Load Data",
"Initial EDA",
"Missing Value Treatment",
"Outlier Handling & Capping",
"Feature Encoding & Scaling",
"Train-Test Split",
"Modeling:\n(LogReg / DT / RF)",
"Hyperparameter Tuning",
"Threshold Optimization\n(Optional)",
"Evaluation:\n(Conf. Matrix, AUC, F1)",
"Model Comparison",
"Final Recommendation"
]
# Position with more space between steps
x = list(range(1, len(steps)*2, 2)) # 1, 3, 5, ..., 23
y = [1] * len(x)
plt.figure(figsize=(24, 3))
sns.set_style("whitegrid")
# Plot step boxes
for i in range(len(steps)):
plt.text(x[i], y[i], steps[i],
bbox=dict(facecolor='skyblue', edgecolor='black', boxstyle='round,pad=0.5'),
ha='center', va='center', fontsize=10)
# Draw arrows between them
for i in range(len(steps)-1):
plt.arrow(x[i]+0.5, y[i], x[i+1] - x[i] - 1, 0,
head_width=0.1, head_length=0.3, fc='gray', ec='gray')
# Format plot
plt.xlim(0, max(x)+1)
plt.ylim(0.5, 1.5)
plt.axis('off')
plt.title("Modeling Process Flowchart", fontsize=16)
plt.tight_layout()
plt.show()
I evaluated five models to classify loan applicants as likely defaulters (BAD=1) or not (BAD=0):
Key metrics such as Recall, Precision, F1 Score, and AUC were used to assess performance.
| Model | Accuracy | Recall | Precision | F1 Score | AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.64 | 0.65 | 0.31 | 0.42 | 0.69 |
| Decision Tree | 0.84 | 0.54 | 0.60 | 0.57 | 0.72 |
| Tuned Decision Tree | 0.79 | 0.84 | 0.49 | 0.62 | 0.87 |
| Random Forest | 0.89 | 0.62 | 0.80 | 0.70 | 0.94 |
| Tuned Random Forest | 0.82 | 0.82 | 0.52 | 0.64 | 0.88 |
Tuned Random Forest and Tuned Decision Tree both significantly outperform the baseline in terms of Recall, a critical metric given the cost of missing defaulters.
From EDA and feature importance analysis, were derived several key insights:
DEROG, DELINQ, and CLAGE were top predictors of default.DEBTINC values significantly increased default risk.REASON = DebtCon).I propose adopting the Tuned Random Forest Classifier as the final solution.
This solution aligns with the business goal: minimize undetected defaulters while maintaining overall accuracy, thus supporting better lending decisions and credit risk management.